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Local increases in fundamental frequency (Fo) and 
large integrals of energy in the syllabic nucleus are knovn to be 
among the best acoustical correlates of stress. Major syntactic 
constituents have been £;hovn to have archetype 

rapid*rise*then«gradual-f all Fo contours, tilth the rise into the 
maximum Fo often associated vith the first stressed syllable in the 
coi^stituent. An automatic precedure for detecting constituent 
boundarf.es and maximum Fo positions in constituents, and sonorant 
energy and Fo functions, provided input data for an algorithm for 
locating stressed sylXeibles. The first stressed syllable of a 
constituent was associated vith a high*energy*integral portion near 
the rising Fo into maximum Fo position. Other stressed syllables vere 
associated vith high*energy*integral portions near local increases in 
Fo above a steadily^falling '^archetype line** from the maximum to 
position to the end ojf the constituent. For over 400 seconds of 
speech, including vritten texts, questions, commands, and 
declarations for uan^machine interaction, over 65% of all syllables 
perceived as stressed by a panel of listeners vere correctly located. 
(Author/DD) 
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ABSTRACT 

Looal Increases in fundamental frequency (Fq) and large integrals 
of energy in the syllabic nucleus are knovn to be among the best acoustical 
correlates of stress. Major syntactic constituents have been shovn 
to have archetype rapid-rise-then-gradual-fall Fq contours, with the 
rise into the maxlmim Fq often associated with the first stressed 
syllable In the constituent. An automatic procedure for detecting 
constituent boundaries and marl.mum Fq positions in constituents (Lea, W. A« 
(1973), An Approach to Syntactic Recognition without Phonemics, IEEE 
Trans* Audio and Electroacoustics , AU-21, No. 3), and sonorant energy 
and Fq functions, provided input data for an algorithm for locating 
stressed syllables* Hic first stressed syllable of a constituent 
uaa associated with a high-energy-integral portion near the rising Fq 
into mariTmrn Fq position. Other stressed syllables were associated idth 
high-energy-integral portions near local increases in Fq above a 
steadily-falling "archetype line" from the Tnaxlmum Fq position to the 
end of the constituent. For over J^OO seconds of speech, including 
viritten texts, and questions, commands, and declarations for man-maclxlne 
Interaction (involving fifteen talkers), over B5% of all syllables 
perceived as stressed by a panel of listeners were correctly located. 
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AN ALGORITHM 
FOR LOCATING STRESSED SYLLABLES 
IN CONTINUOUS SPEECH 

Wayne A. Lea 

An algorithm for locating stressed syllables from prosodic 
features of energy and fundamental frequency has been devised. It is 
based on local increases in fundamental frequency, and large integrals 
of energy within the syllabic nucleus, being the most reliable acoustic 
correlates of stress. This algorithm also incorporates adjustments based 
on the most common ("archetype") fundamental frequency contours within the 
grammatical phrases and clauses of connected speech. 

Connected speech texts whose stress patterns were studied included 
a paragraph of the Rainbow Script read by six talkers, a paragraph 
composed of only monosyllabic words ("Monosyllabic Script") read by two talkers, 
and 31 spontaneous sentences intended for man*~computer interaction, which had 
been recorded by nine talkers involved in the ARPA Speech Understanding 
Research Program. In a companion study reported on in another paper at 
this meeting, a panel of listeners repeatedly heai^l these spoken texts 
lantil they could provide judgments as to which syllables \rere stressed, 
unstressed, or reduced. 

The spoken scripts were processed through an autocorrelation 
algorithm for fundamental frequency tracking (or "pitch" tracking), 
and through an algorithm which provided a so-called "sonorant" energy 
function, \Aich gives the speech energy within the frequency range of 
60 Hz to 3000 Hz. This sonorant energy function should give high energy 
values within sonorant syllabic nuclei, while giving lower values 
dtudjig obstruents. 

The first slide shows a stylized plot of fundamental frequency, on 
a logarithmic or eighth- tone scale, and a corresponding plot of sonorant 
energy on a dB scale. The algorithm then operates on this data as 
follows. First, as the next slide shows, the connected speech is 
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aegmented Into eontences and major graimnatlcal constituents by an 
algorithm for detecting phrase boundaries at the bottoms of substantial 
fall-rise "valleys" In fundamental frequency contours (Lea, 1971, 1972, 
1973)» The Increasing fundamental frequency near the beginning of each 
constituent Is assumed to be attributable to the first stressed syllable 
or •'HEAD*' of the constituent, as shown on the next slide, ^ A portion of 
the speech which Is high In energy with Increasing fundamental frequency 
values, and which is bovuided by points where the energy dips 
5 dB or more. Is asserted to be the stressed nucleus of this HEAD 
syllable. This Is shown by the blue-. tinted portions In this slide. 
Previous studies have shown that this stress-Induced Initial rise In 
fundamental frequency In a constituent Is usually followed by a gradual 
fall In fundamental frequency, which may be approximated by a straight 
line on the logarithmic frequency scale. As shown In the next slide, 
the "archetype line" steadily drops In eighth tone values from the maximum 
fundamental frequency In the constituent down to the low value at the 
end of the constituent. Other stressed syllables in the constituent are 
expected to be accompanied by local increases in fundamental frequency - 
increases \^ch make the fundamental frequency contour locally rise 
above the archetyi)e line. Thus, even though faodamental frequency may 
not be rising absolutely at such stressed syllables, the fact that it is 
not falling at its usual rate can be a cue to the presence of a stressed 
syllable. The stressed syllable is again located within a high-energy 
region bounded by 5 dB dips in energy, as shown by the new yellow-tinted 
portions on the slide. 

Detailed descriptions of this algorithm are available in published 
reports (Lea, 1973; Lea, Medress, and Skinner, 1973). The next slide 
shows the overall comparison between the algorlthmically located stressed 
syllables and the listeners* perceptions of stressed syllables. For 
each text, and with results pooled for talkers, the table here gives the 
percentages of all syllables perceived as stressed (by two or more listeners) 
that the algorithm correctly located within the Mgh-energy portions of 
ap^iech. Occasionally the algorithm located a stretch of speech that did 
not enclose any syllable perceived as stressed by the listeners. Dividing 
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the number of euch false locations by the total number of algorlthmlcally 
located portions gives the percentage of all locations that were false. 

VAiile scores varied somewhat from text to text and talker to talker ^ 
the overall average of B6% correct location of stressed syllables is 
very encouraging. Scores for the Rainbow Script read by six talkers 
ranged trm 7S% to 9S%. Results for only two talkers reading the Rainbow 
Script are shown pooled here, for ease of direct comparison with results 
by the same two talkers reading the Monosyllabic Script. The Monosyllabic 
Script, with its fewer reducel syllables and more prominant stresses 
on monosyllabic content wo.Tds, yielded quite high scores. The spontaneous 
ARPA Sentences 9 \diich were more monotone and \idiich gave some difficulties 
to the constituent boundary detection algorithm, showed lower stressed 
syllable location scores. False locations resulted from falsely 
detected syntactic constituent boundaries und "borderline^ cases of 
syllable? perceived as stressed by at least one individual listener. 
Some of the fallxires to locate stressed syllables resulted from lack of 
fundamental frequency increases on some stressed syllables. A few 
failures resulted from more than one stressed syllable being within 
the initial portion of the constituent that has increasing fundamental 
frequency. Tlie ultimate ugg of a stressed syllable location algorithm 
will determine whether false alarms or failures to locate stressed 
syllables are the least desirable errors. 

To further evaluate the effectiveness of this archetype contour 
algorithm for locating stressed syllables, these results were compared 
with results in stressed syllable location by other procedtares. The next 
jlide shows one simple procedure which finds all dips and peaks in 
the sonorant energy function and delimits syllabic nuclei as all contiguous 
points within 5 dB of the maximum intensity value in each high-intensity 
"chunk" or syllable. Then, those chunks (or syllabic nuclei) that have 
a minimum duration of 100 ms are declared to be stressed. 

Another simple subroutine, shoim in the next slide, locates all 
portions of speech \^ere, for 100 ms or longer, fundamiantal frequency 
does not decrease more than one eighth tone per ten milliseconds 
(this is sort of a relaxed form of a process of finding regions lAere 
fundamental frequency is steadily rising, or at least not falling rapidly). 



The next slide shows that the location of stressed syllables from 
durations of high-intensity chunks works surprisingly well in read 
texts with sharply contrasting stresfl levels, such as the Monosyllabic 
Script, but it is not as effective in more complicated read texts such 
as the Rainbow Script or in spontaneous speech such as the ARPA Sentences. 
Lowest percentages of correct location and highest percentages of false 
alaras oc^mr for the spontaneous ARPA sentences. The next slide shows 
that regions of increasing fundamental frequency are also less reliably 
related to stressed syllables in such sentences, and generally give 
poorer performance even in the Monosyllabic Script. The archetype- 
contour algorithm obviously performs better thai/ either of these two 
simpler algoirthms, particulary for spontaneous speech. The next slide 
suimnarizes relative performance of the algorithms, showing that about 
^0% more stressed syllables are correctly located and about 10^ fewer 
false alarms occur for the archetype-contour algorithm. 

The last slide shows how stressed syllable location by the algorithms 
is affected by the type of sentence spoken (for the ARPA Sentences). 
For each algorithm, false alarms (shown within the orange boxes) are 
most frequent in yes/no questions. As shown within the yellow bands 
the lowest correct location score from chunk durations occurs in 
yes/no questions, while the highest correct location score from 
increases in fundamental frequency occxirs in yes/no questions. Thid 
suggests the value of combining the two types of cues to improve 
success in stressed syllable location, such as is done in the archetype- 
contoxir algorithm. 

In general, it is apparent that fairly accurate procedures are 
available for locating stressed syllables in continuous speech, particularly 
for read texts with sharp stress contrasts. Even the simplest procedures 
can locate on the order of 755^ or more of the stressed syllables, but 
complex algorithms seem to be approaching 95% location with on the order 
of 2056 false alarms. Further improvements now being implemented include 
other combinations of energy and fundamental frequency cues, and the incor- 
poration of confidence measures to assess just how sure each algorithm is 
that each portion of speech is or is not a stressed syllable. Further 
studies will be conducted using designed speech texts which isolate effects 
that sentence type, constituent structure, different lexical insertions, and 
phonetic content have on the location of stressed syllables. 
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